A Combined Theory Data-Driven Approach to Classifying Delinquency Risk in the Future of Families and Child Well-Being Study

About Me


Name: Nicholas Vietto


PhD Candidate at the University of Nebraska - Omaha


Research Interests: Biopsychosocial Criminology, Quantitative Methods, Data Visualization, Open-Science, Open-Source Software


Introduction


Goal:

Classify delinquency risk at age 15 using data from ages 9 and 15 in the Future of Families and Child Wellbeing Study (FFCWS).


How:

Building on the findings and method of Chan et al. (2023), we implement a feed-forward neural network using the {tidymodels} framework in R.

Chan and Colleagues (2023)




Using a model with factors that included social, psychological, and biological domains, outperformed models using any single domain in predicting a CD diagnosis with 91.18% accuracy.

Extending Chan and Colleagues (2023)


Using Future of Families and Child Wellbeing Study (FFCWS):


  • Expanded Sociological Domain: Incorporates rich socio-environmental predictors, including census tract variables, labor market and proximity to gun-violence incidents.

  • Incorporating Genetic Data: Specifically, incorporate genes involved in the serotonergic and dopaminergic pathways to examine the role of polymorphic variation.

  • Classifying Delinquency Risk rather than a CD diagnosis.

Future of Families and Child Wellbeing Study (FFCWS)

The Great Theory Bake Off


Socio-Environmental Domain

  • Parental Monitoring Scale (Focal Child, Year 15)

  • Neighborhood Collective Efficacy Scale (Focal Child, Year 15)

  • Conflict Tactics Scale (Focal Child, Year 15)

  • Material Hardship Scale (PCG, Year 15)

Psychological Domain

  • BSI 18 Anxiety Scale (Focal Child, Year 15)

  • Center for Epidemiologic Studies Depression Scale (CES-D) (Focal Child, Year 15)

  • Dickman’s Impulsivity Scale (Focal Child, Year 15)

Genetic Domain

  • SLC6A4 Gene (Serotonin Transporter Gene)
    • 5HTTLPR
    • STin2
  • TPH2 Gene (Tryptophan Hydroxylase 2 Gene)
    • rs4570625
    • rs1386494

Why Machine Learning?

Improvements in Empirical Analysis


  • Advanced Data Processing: Efficiently handles and analyzes large amounts of data to enhance predictive power.

  • Uncovering Complex Relationships: Identifies non-linear and higher-order interactions, especially in high-dimensional datasets, providing deeper insights into variable relationships (e.g., high dimensional data like image, audio, etc.).

  • Enhanced Predictive Accuracy: Continuously refines predictions through iterative learning, improving overall accuracy over time.

  • Further Reading: Mapping of machine learning approaches for description, prediction, and causal inference in the social and health sciences

Why Machine Learning?

Feed Forward Neural Networks

Machine Learning Workflow using the {tidymodels} framework in R


Machine Learning Workflow using the {tidymodels} framework in R

Machine Learning Workflow

Data Spending


60/20/20 split for training, validation, and testing.

2128 observations after merging data

  • 1276 for training
  • 426 for validation
  • 426 for testing

Sample Descriptive Statistics

Predictor Descriptive Statistics

Preliminary Results

Preliminary Results

Preliminary Results

Preliminary Results

Limitations


  • Genetic Data Constraints: Genetic information is confined to markers from the candidate gene era, potentially limiting genomic coverage.

  • Sample Size: The relatively small sample size may impact the robustness and generalizability typical for machine learning applications.

  • Age of Assessment: Age 15 may be early for assessing delinquency risk, as behaviors predictive of long-term patterns may not yet be fully evident.

Future Directions


  • Enhance Domain Optimization: Add features to maximize the model’s performance in each specific domain (e.g., adding labor markets for distal predictors in the sociological domain).

  • Evaluate Fairness Across Ethnicities: Assess the final model’s performance across different ethnic groups to ensure fairness, verifying it does not exhibit biases against social or minority groups.

  • Test mode on Year 22 data: Validate the model’s performance on the Year 22 data to assess its generalizability and predictive power.

Q u e s t i o n s ?

Supplemental Materials

The Tale of Two Cultures

Data Modeling Culture


Primary Focus: Deriving causal inference

Approach: Emphasizes deductive reasoning

Process: Models the data-generating process to clarify relationships between X and Y

Culture: Grounded in methodologies developed primarily by statisticians

Algorithm Modeling Culture


Primary Goal: Maximizing predictive accuracy

Approach: Emphasizes inductive reasoning, with a focus on learning patterns directly from data

Process: Utilizes black-box models to capture relationships between X and Y

Culture: Rooted in methodologies developed primarily by computer scientists

Supplemental Materials

Neural Networks

Supplemental Materials

Confusion Matrix (All Domains)

Supplemental Materials

Class Balance Check



Supplemental Materials

Activation Functions